RoboBench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

1 State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
2 Beijing Academy of Artificial Intelligence
3 Institute for Brain and Intelligence, Fudan University
4 University of Science and Technology Beijing
5 Beijing Innovation Center of Humanoid Robotics

*Equal contribution, Project leader, Corresponding author

Overview of RoboBench. We evaluate MLLMs as embodied brains across 5 dimensions, 14 capabilities, and 25 tasks, with tasks color-coded by dimension (top left). These dimensions follow the embodied execution pipeline (bottom)—from understanding intent, perceiving the environment, planning and adapting actions, refining subgoals via affordances, to diagnosing failures—capturing the core cognitive roles of System 2. Performance comparison (top middle) reveals significant gaps among state-of-the-art MLLMs. (top right) RoboBench scores strongly correlate with downstream VLA performance on CALVIN.

Abstract

Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential for advancing robotic intelligence. Yet existing benchmarks emphasize execution success, or, when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the distinct cognitive roles required across the full manipulation pipeline, RoboBench defines five dimensions—Instruction Comprehension, Perception Reasoning, Generalized Planning, Affordance Prediction, and Failure Analysis—spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, multi-view scenes, and memory-driven navigation, drawing from large-scale real robotic data and in-house collection. For planning, RoboBench introduces an evaluation framework that uses an MLLM as a world simulator. It moves beyond symbolic matching to evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes under physical and visual constraints, enabling faithful assessment of long-horizon reasoning. Experiments on 18 state-of-the-art MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. We further analyze how embodied cognitive abilities relate to downstream robotic control. RoboBench provides a comprehensive scaffold to quantify high-level cognition, clarify the role of the embodied brain, and guide the development of next-generation MLLMs for more robust robotic intelligence.

News

🎉 2026.07 - RoboBench is accepted to ECCV 2026! This page has been updated to the camera-ready version: the leaderboard now covers 18 state-of-the-art MLLMs (GPT-5.4 / GPT-5, Claude-Opus-4.7 / Sonnet-4.6, Gemini-3.1-Pro, Qwen3-VL, MiMo-Embodied, RoboBrain-2.5, and more), evaluated with the MLLM-as-world-simulator planning framework.
🔥 2025.10.23 - Dataset and code have been released! If you encounter any issues, please feel free to submit an issue on GitHub, and we will check and address them as soon as possible❗️
🔥 2025.10.21 - The paper has been released! Code and dataset are being organized and will be released soon. Stay tuned❗️

Highlight

🔍 Benchmark Overview
  • RoboBench systematically evaluates MLLMs as embodied brains rather than only measuring final robot execution success.
  • The benchmark spans 5 core dimensions, 14 capabilities, 25 task types, and 6,092 high-quality QA pairs.
🧭 Embodied Execution Pipeline
  • Tasks cover the full manipulation reasoning pipeline: instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis.
🛡️ Planning Beyond Text Matching
  • Long-horizon planning is evaluated with a DAG-guided MLLM-as-world-simulator that checks action alignment and task completion under physical and visual constraints.
  • Q2 and Q3 further test next-step prediction and task-state estimation, making the planning evaluation interpretable at the step level.
🧠 Real-world Data
  • Data are curated from open-source real-robot datasets and in-house collection across diverse embodiments, objects, views, and navigation scenarios.
  • Dimension-specific validation and human review keep QA items grounded in the visible scene and executable task context.
🌍 Broad Model Audit and Downstream Signal
  • The camera-ready leaderboard covers 18 state-of-the-art closed-source, open-source, and embodied MLLMs, plus a text-only ablation.
  • RoboBench scores also correlate with downstream VLA success, linking high-level embodied cognition to robot-control transfer.

Demo Case

Demo Case

Demo case. A representative RoboBench item illustrates how a robot scene is converted into grounded questions and evaluated across embodied reasoning skills, including perception, planning, affordance prediction, and failure analysis.

Leaderboard

Model Perception Reasoning
Robotic-centric Object-centric Scene-centric Task-centric Avg
Robot-type Robot-view Static Attr. Functional Attr. Spatial Relation Temp. Grounding Causality Refer. Comprehen.
Basic Reference
Human Evaluation 80.67 79.08 43.77 83.89 70.91 51.61 91.22 93.22 74.30
GPT-5.4-text-only 25.86 28.26 8.81 45.57 32.67 22.90 34.48 18.40 27.12
Closed-Source MLLMs
GPT-5.4 73.28 50.00 42.86 73.42 54.46 38.93 45.52 71.17 56.20
GPT-5.2 68.10 39.86 38.60 77.22 47.52 30.53 54.48 71.17 53.44
GPT-5 64.66 47.10 49.24 69.62 54.46 48.09 74.48 78.53 60.77
GPT-4.1 66.38 50.00 40.43 68.35 47.52 22.14 56.55 73.01 53.05
GPT-4o 75.00 39.13 18.24 60.76 49.50 22.14 43.45 55.21 45.43
Claude-Opus-4.7 76.72 53.62 57.14 81.01 48.51 46.56 65.52 71.78 62.61
Claude-Sonnet-4.6 53.45 47.83 53.80 69.62 52.48 29.01 57.24 69.33 54.09
Claude-Sonnet-4.5 46.55 33.33 37.08 72.15 48.51 33.59 51.72 36.81 44.97
Claude-Haiku-4.5 44.83 33.33 30.70 56.96 25.74 22.14 45.52 27.61 35.85
Gemini-3.1-Pro 71.55 49.28 66.26 78.48 61.90 31.93 88.97 90.18 67.32
Gemini-2.5-Pro 67.24 43.48 57.14 82.28 57.43 52.07 73.10 80.37 64.14
Gemini-2.5-Flash 66.38 34.78 57.75 74.68 55.45 34.92 75.17 76.69 59.48
Open-Source Multi-Image MLLMs
Qwen3-VL-8B 52.59 36.96 27.66 65.82 36.63 25.95 31.72 54.60 41.49
Qwen2.5-VL-7B-Ins 37.07 23.19 24.32 56.96 26.73 22.14 33.10 34.36 32.23
LLaVA-OneVision-7B 31.03 26.81 39.21 68.35 42.57 18.32 33.79 50.92 38.88
Embodied MLLMs
RoboBrain-2.0-7B 31.90 19.57 28.57 44.30 34.65 21.37 24.83 33.13 29.79
RoboBrain-2.5-4B 35.34 24.64 39.82 77.22 53.47 18.32 59.31 44.17 44.04
MiMo-Embodied-7B 25.86 21.74 32.22 65.82 49.50 24.43 57.93 43.56 40.13
Model Instruction Comprehension Generalized Planning (Q1)
Explicit Implicit Avg Cross-Embodiment Planning Cross-Object Planning Cross-View Planning Cross-Task Planning Avg
Single-arm Dual-arm Mobile-manip. Human Material Afford. Physical Attr. World Knowl. Multi Single Navigation Plan.
Basic Reference
Human Evaluation 59.94 61.13 60.54 72.50 41.93 41.55 62.28 56.70 58.98 49.36 52.82 51.59 45.23 54.50
GPT-5.4-text-only 74.88 38.54 56.71 83.53 66.47 74.71 62.65 76.03 80.36 66.95 71.33 72.41 47.66 73.74
Closed-Source MLLMs
GPT-5.4 74.58 50.80 62.69 85.56 70.28 75.59 56.26 78.64 64.83 72.44 73.33 72.50 53.92 70.91
GPT-5.2 75.90 48.85 62.38 86.92 70.00 80.38 60.08 81.60 64.62 75.49 75.84 71.22 55.03 72.31
GPT-5 77.63 54.71 66.17 84.25 69.48 81.27 70.13 81.58 59.80 71.95 70.95 72.76 58.29 71.84
GPT-4.1 76.23 57.30 66.77 88.08 68.56 78.17 55.38 81.37 65.76 64.88 73.70 71.83 52.94 71.79
GPT-4o 74.22 54.90 64.56 86.02 66.40 76.44 62.31 80.86 72.63 73.78 64.50 65.63 57.62 73.46
Claude-Opus-4.7 73.92 61.52 67.72 89.43 71.98 84.30 68.53 84.56 67.70 70.00 75.06 72.06 59.53 75.54
Claude-Sonnet-4.6 79.94 61.38 70.66 88.89 73.93 84.42 66.19 84.76 79.10 73.54 80.55 77.90 67.06 79.38
Claude-Sonnet-4.5 76.65 53.62 65.13 89.11 75.06 81.70 64.27 85.10 69.70 76.22 82.17 75.06 59.52 76.62
Claude-Haiku-4.5 73.78 42.88 58.33 86.13 74.07 76.63 60.38 81.37 58.93 71.75 78.48 71.73 50.87 71.01
Gemini-3.1-Pro 73.25 59.90 66.58 80.64 69.85 79.90 50.13 73.11 71.66 69.63 74.64 74.39 57.14 70.71
Gemini-2.5-Pro 76.20 60.96 68.58 83.53 69.08 84.31 59.16 76.72 66.93 77.68 73.57 75.24 58.14 72.93
Gemini-2.5-Flash 71.45 49.90 60.67 83.58 69.41 81.06 58.43 75.72 70.76 74.88 72.14 72.65 55.08 70.98
Open-Source Multi-Image MLLMs
Qwen3-VL-8B 59.46 30.80 45.13 74.49 44.54 57.98 52.25 63.88 54.66 55.49 49.15 58.75 37.22 56.71
Qwen2.5-VL-7B-Ins 56.04 23.90 39.97 73.89 30.19 56.06 53.85 58.90 57.78 53.90 25.83 37.50 11.95 49.92
LLaVA-OneVision-7B 38.25 10.61 24.43 54.87 31.05 35.88 43.99 37.59 51.37 30.00 31.43 36.60 25.11 41.02
Embodied MLLMs
RoboBrain-2.0-7B 43.54 21.10 32.32 62.49 30.16 44.42 42.90 46.62 52.87 45.24 31.25 32.69 25.98 45.12
RoboBrain-2.5-4B 36.30 16.65 26.47 39.32 23.99 45.87 54.53 31.69 29.16 24.39 28.39 25.75 23.97 31.85
MiMo-Embodied-7B 66.87 37.30 52.09 82.20 37.11 61.76 63.03 73.05 66.85 70.88 58.95 43.88 28.71 62.72
Model Instr. Compre. Generalized Planning
Explicit Goal Single Arm Material Afford. World Knowl.
Q2 Q3 Q2 Q3 Q2 Q3 Q2 Q3
Basic Reference
Human Evaluation 45.28 74.32 27.52 71.35 43.62 71.20 43.89 69.83
GPT-5.4-text-only 36.98 46.25 40.98 52.86 40.43 52.33 43.01 41.46
Closed-Source MLLMs
GPT-5.4 49.48 62.50 48.20 67.85 44.38 64.67 42.19 51.22
GPT-5.2 39.32 75.00 42.86 73.02 41.07 66.67 37.50 56.10
GPT-5 44.09 72.97 47.26 75.75 44.38 62.83 39.58 63.41
GPT-4.1 45.31 70.00 48.62 63.76 44.32 63.67 39.58 58.54
GPT-4o 42.86 65.00 43.94 59.95 41.23 55.33 42.19 51.22
Claude-Opus-4.7 44.53 67.50 51.08 65.12 45.97 63.00 45.31 63.41
Claude-Sonnet-4.6 43.75 70.00 43.22 66.21 43.90 64.33 39.58 60.98
Claude-Sonnet-4.5 41.67 61.25 44.73 56.68 41.99 54.00 41.15 48.78
Claude-Haiku-4.5 34.13 62.50 43.07 60.38 38.14 63.18 31.77 70.73
Gemini-3.1-Pro 44.79 71.25 50.65 64.58 46.97 67.67 26.56 65.85
Gemini-2.5-Pro 37.76 72.50 52.02 71.66 49.20 70.00 35.42 68.29
Gemini-2.5-Flash 47.66 77.50 49.85 54.22 44.81 68.83 45.31 68.29
Open-Source Multi-Image MLLMs
Qwen3-VL-8B 49.74 62.50 50.94 61.58 44.80 55.50 36.98 56.10
Qwen2.5-VL-7B-Ins 29.43 55.00 31.88 52.59 30.17 49.33 20.31 51.22
LLaVA-OneVision-7B 33.59 41.25 35.06 46.05 35.97 40.50 34.90 43.90
Embodied MLLMs
RoboBrain-2.0-7B 34.90 52.50 33.84 53.95 34.29 49.00 27.60 48.78
RoboBrain-2.5-4B 32.81 58.75 37.37 54.50 35.25 55.17 32.29 56.10
MiMo-Embodied-7B 38.17 55.00 43.87 55.31 42.90 51.50 46.88 63.41
Model Affordance Prediction Failure Analysis
Static Dynamic Naviga. Avg Execution Planning Avg
Basic Reference
Human Evaluation 86.08 80.02 81.85 82.63 47.30 80.67 63.99
GPT-5.4-text-only 23.81 27.52 25.51 25.61 11.92 32.64 22.28
Closed-Source MLLMs
GPT-5.4 44.22 36.91 58.16 46.43 26.49 65.97 46.23
GPT-5.2 43.54 39.60 47.96 43.70 26.49 68.06 47.27
GPT-5 62.59 49.66 62.24 58.16 19.87 80.56 50.21
GPT-4.1 29.93 42.95 68.37 47.08 20.53 70.83 45.68
GPT-4o 40.82 42.28 50.00 44.37 31.79 57.64 44.71
Claude-Opus-4.7 53.74 62.42 79.59 65.25 14.57 72.22 43.40
Claude-Sonnet-4.6 37.41 52.35 41.84 43.87 17.88 77.78 47.83
Claude-Sonnet-4.5 34.69 38.93 53.06 42.23 14.57 63.19 38.88
Claude-Haiku-4.5 27.89 26.17 21.43 25.16 17.22 45.83 31.53
Gemini-3.1-Pro 82.31 77.85 96.94 85.70 25.17 80.74 52.95
Gemini-2.5-Pro 65.99 61.07 93.88 73.65 18.54 74.29 46.41
Gemini-2.5-Flash 61.22 69.80 36.73 55.92 25.83 65.49 45.66
Open-Source Multi-Image MLLMs
Qwen3-VL-8B 23.81 17.45 22.45 21.24 22.52 55.56 39.04
Qwen2.5-VL-7B-Ins 18.37 31.54 26.53 25.48 13.91 35.42 24.66
LLaVA-OneVision-7B 38.78 33.56 66.33 46.22 20.53 31.25 25.89
Embodied MLLMs
RoboBrain-2.0-7B 31.97 27.52 31.63 30.37 15.23 40.28 27.75
RoboBrain-2.5-4B 50.34 21.48 72.45 48.09 43.71 46.53 45.12
MiMo-Embodied-7B 51.70 36.91 70.41 53.01 19.21 42.36 30.78

Key Findings from RoboBench Evaluation

Overall Findings

🥇 Large Capability Gaps, yet the Frontier Keeps Advancing

Gemini-3.1-Pro shows the most consistent advantages across perception, affordance, and failure dimensions—67.32 in perception reasoning (vs. the next-best 64.14 of Gemini-2.5-Pro), 85.70 in affordance prediction, and 52.95 in failure analysis—leaving only a modest residual gap to the human reference (74.30 in perception), while most other MLLMs remain highly uneven or generally weak.

🔒 Closed-Source Models Lead by ~20 Points

Closed-source MLLMs lead open-source ones in every dimension, by about 20 points on average (~50% relative)—widest in instruction comprehension (~28) and generalized planning (~25), narrowest in failure analysis (~13). Within the same family, performance improves consistently with model size and generation.

🧭 Plans Sound Plausible but Break at Execution

Planning failures are dominated by a perception-action gap: 45% are execution errors from missing or incorrect action sequences, while 24% are identification errors, 25% commonsense or physical-constraint errors, and 6% mode-specific format errors. Current MLLMs often reason plausibly but still fail to produce executable embodied actions.

👀 Vision Is Truly Required, Not Commonsense Recall

A text-only baseline (GPT-5.4 without images) stays close to random on perception (27.12) and affordance (25.61) tasks, far below the best vision-conditioned MLLM (67.32 / 85.70). RoboBench questions demand grounding in the observed scene—visual state, embodiment, and physical feasibility.

Fine-grained Findings

🧠 Implicit Intent Understanding Remains a Major Challenge

Even the strongest explicit-goal model (Claude-Sonnet-4.6) drops from 79.94 to 61.38 when instructions become implicit, and the gap widens for weaker MLLMs. A paired chain-of-thought rewriting ablation shows this is a genuine intent-grounding limitation, not a prompting artifact.

👁️ Perception Bottlenecks: Embodiment and Time

Models handle object attributes well (up to 82.28 on functional attributes) but struggle with robotic perception and spatiotemporal reasoning: the best scores are only 53.62 on robot-view understanding and 52.07 on temporal grounding—the two weakest perception tasks. Stronger embodiment-aware perception and explicit spatiotemporal reasoning are needed.

🧩 Planning Limitations Persist

Cross-embodiment: models trained mostly on single-arm settings fail to coordinate dual-arm actions or mobile manipulation. Cross-object: performance drops sharply on uncommon objects, symbolic reasoning, and world knowledge. Cross-view: multi-view inputs effectively recover performance when the front view is occluded, underscoring the value of multi-view reasoning.

⚙️ Execution-Failure Diagnosis Is Extremely Hard

Diagnosing execution-level errors is far harder than planning-level ones: the best model reaches only 43.71 (most fall between 15 and 27), while planning-error diagnosis reaches 80.74; the human reference shows the same asymmetry (47.30 vs. 80.67). It requires fine-grained spatial and physical understanding, e.g., separating position errors from rotation errors.

In-depth Analysis: VLM-VLA Consistency

RoboBench and VLA correlation heatmap

To test whether RoboBench captures abilities that matter for robot control, we convert several open-source VLM backbones into VLA policies with minimal fine-tuning and evaluate them on CALVIN and LIBERO-10.

RoboBench perception scores strongly align with long-horizon CALVIN performance: object-centric perception reaches r=0.884 and scene-centric perception reaches r=0.833. For LIBERO-10, the strongest signal shifts toward fine-grained interaction dynamics, where static+dynamic affordance prediction reaches r=0.677.

These correlations indicate that different VLA benchmarks depend on different cognitive skills. RoboBench therefore serves not only as a leaderboard, but also as a diagnostic tool for identifying which VLM capabilities are likely to transfer into downstream embodied policies.

Dataset Construction Pipeline

Dataset Construction Pipeline. RoboBench integrates open-source and self-collected robot data under a shared process—preprocessing → tool-assisted + human-in-the-loop annotation → unified schema → auto-generated QA—and builds datasets for five dimensions: Instruction Comprehension: pair explicit instructions with LLM-rewritten implicit variants to test intent understanding. Perception Reasoning: use captioning/detection/segmentation tools to draft labels across robotic/object/scene/task views, then human-refine and standardize. Generalized Planning: construct a planning pool from robot videos; VLMs produce step/timestamp summaries and metadata, which are mapped to function templates to support Q1/Q2/Q3 evaluations. Affordance Prediction: sample key frames and annotate static (contact points), dynamic (trajectories), and mobile (base positions) affordances. Failure Analysis: mine execution-level failures from real trials and synthesize planning-level errors by perturbing correct instructions. All outputs follow one schema and are rendered into binary, single-choice, and multi-step multiple-choice QA formats for open- and closed-source MLLMs.

Planning Evaluation Pipeline

Planning Evaluation Pipeline

Planning Evaluation Framework. Evaluation of the planning dimension (Q1–Q3). Each task is decomposed into a sequence of parameterized atomic actions forming a Directed Acyclic Graph (DAG) that encodes causal and temporal dependencies. For Q1 (Long-horizon planning), an MLLM-based world simulator assesses both NodeCorrectness (action alignment) and TaskCompletion (goal-state achievement) by simulating action rollouts under visual and physical constraints. Q2 (Next-step planning) evaluates fine-grained step prediction by comparing skill, object, and parameter accuracy, while Q3 (Task state estimation) measures binary correctness on whether a subtask has been completed. Together, the pipeline provides a unified, interpretable framework for assessing structural correctness and embodied feasibility in planning.

BibTeX

@misc{luo2025robobenchcomprehensiveevaluationbenchmark,
                title={RoboBench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain},
                author={Yulin Luo and Chun-Kai Fan and Menghang Dong and Jiayu Shi and Xiangju Mi and Mengdi Zhao and Bo-Wen Zhang and Cheng Chi and Jiaming Liu and Gaole Dai and Rongyu Zhang and Ruichuan An and Kun Wu and Zhengping Che and Shaoxuan Xie and Guocai Yao and Zhongxia Zhao and Pengwei Wang and Guang Liu and Zhongyuan Wang and Tiejun Huang and Shanghang Zhang},
                year={2025},
                eprint={2510.17801},
                archivePrefix={arXiv},
                primaryClass={cs.RO},
                url={https://arxiv.org/abs/2510.17801},
                note={Accepted to ECCV 2026},
            }